数据探索很麻烦?推荐一款史上最强大的特征分析可视化工具:yellowbrick
前言
# Load the classification data set
data = load_data("occupancy")
# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]
classes = ["unoccupied", "occupied"]
# Extract the instances and target
X = data[features]
y = data.occupancy
# Import the visualizer
from yellowbrick.features import RadViz
# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.poof() # Draw/show/poof the data
from yellowbrick.features import Rank1D
# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=features, algorithm='shapiro')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.poof() # Draw/show/poof the data
# Load the classification data set
data = load_data('concrete')
# Specify the features of interest and the target
target = "strength"
features = [
'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]
# Extract the instance data and the target
X = data[features]
y = data[target]
visualizer = PCADecomposition(scale=True, proj_features=True)
visualizer.fit_transform(X, y)
visualizer.poof()
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from yellowbrick.features.importances import FeatureImportances
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()
viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()
递归特征消除 Recursive Feature Elimination
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from yellowbrick.features import RFECV
# Create a dataset with only 3 informative features
X, y = make_classification(
n_samples=1000, n_features=25, n_informative=3, n_redundant=2,
n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0
)
# Create RFECV visualizer with linear SVM classifier
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
df = load_data('credit')
target = 'default'
features = [col for col in data.columns if col != target]
X = data[features]
y = data[target]
cv = StratifiedKFold(5)
oz = RFECV(RandomForestClassifier(), cv=cv, scoring='f1_weighted')
oz.fit(X, y)
oz.poof()
from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot
# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)
visualizer.fit(X_train, y_train) # Fit the training data to the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.poof() # Draw/show/poof the data
import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.regressor import AlphaSelection
# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)
# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
g = visualizer.poof()
分类预测误差 Class Prediction Error
类预测误差图提供了一种快速了解分类器在预测正确类别方面有多好的方法。
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassPredictionError
# Instantiate the classification model and visualizer
visualizer = ClassPredictionError(
RandomForestClassifier(), classes=classes
)
# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)
# Evaluate the model on the test data
visualizer.score(X_test, y_test)
# Draw visualization
g = visualizer.poof()
当然也同时有分类评估指标的可视化,包括混淆矩阵、AUC/ROC、召回率/精准率等等。
二分类辨别阈值 Discrimination Threshold
关于二元分类器的辨别阈值的精度,召回,f1分数和queue rate的可视化。辨别阈值是在阴性类别上选择正类别的概率或分数。通常,将其设置为50%,但可以调整阈值以增加或降低对误报或其他应用因素的敏感度。
from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold
# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)
visualizer.fit(X, y) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
聚类肘部法则 Elbow Method
KElbowVisualizer实现了“肘部”法则,通过使模型具有K的一系列值来帮助数据科学家选择最佳簇数。如果折线图类似于手臂,那么“肘”(拐点)就是曲线)是一个很好的迹象,表明基础模型最适合那一点。
在下面的示例中,KElbowVisualizer在具有8个随机点集的样本二维数据集上适合KMeans模型,以获得4到11的K值范围。当模型适合8个聚类时,我们可以在图中看到“肘部”,在这种情况下,我们知道它是最佳数字。
from sklearn.datasets import make_blobs
# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
集群间距离图 Intercluster Distance Maps
集群间距离地图以2维方式显示集群中心的嵌入,并保留与其他中心的距离。例如。中心越靠近可视化,它们就越接近原始特征空间。根据评分指标调整集群的大小。默认情况下,它们按内部数据的多少,例如属于每个中心的实例数。这给出了集群的相对重要性。但请注意,由于两个聚类在2D空间中重叠,因此并不意味着它们在原始特征空间中重叠。
from sklearn.datasets import make_blobs
# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)
from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance
# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))
visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
模型选择-学习曲线 Learning Curve
学习曲线基于不同数量的训练样本,检验模型训练分数与交叉验证测试分数的关系。这种可视化通常用来表达两件事:
1. 模型会不会随着数据量增多而效果变好
2. 模型对偏差和方差哪个更加敏感
下面是利用yellowbrick生成的学习曲线可视化图。该学习曲线对于分类、回归和聚类都可以适用。
模型选择-验证曲线 Validation Curve
模型验证用于确定模型对其已经过训练的数据的有效性以及它对新输入的泛化程度。为了测量模型的性能,我们首先将数据集拆分为训练和测试,将模型拟合到训练数据上并在保留的测试数据上进行评分。
为了最大化分数,必须选择模型的超参数,以便最好地允许模型在指定的特征空间中操作。大多数模型都有多个超参数,选择这些参数组合的最佳方法是使用网格搜索。然而,绘制单个超参数对训练和测试数据的影响有时是有用的,以确定模型是否对某些超参数值不适合或过度拟合。
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from yellowbrick.model_selection import ValidationCurve
# Load a regression dataset
data = load_data('energy')
# Specify features of interest and the target
targets = ["heating load", "cooling load"]
features = [col for col in data.columns if col not in targets]
# Extract the instances and target
X = data[features]
y = data[targets[0]]
viz = ValidationCurve(
DecisionTreeRegressor(), param_name="max_depth",
param_range=np.arange(1, 11), cv=10, scoring="r2"
)
# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()
总结
个人认为yellowbrick这个工具非常好,一是因为解决了特征工程和建模过程中的可视化问题,极大地简化了操作;二是通过各种可视化也可以补充自己对建模的一些盲区。
本篇仅展示了建模中部分可视化功能,详细的完整功能请参考:
https://www.scikit-yb.org/en/latest/index.html
/ 今日留言主题 /